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The goal of binary classification is to estimate a discriminant 
function 7 from observations of covariate vectors and corresponding 
binary labels. We consider an elaboration of this problem in which 
the covariates are not available directly but are transformed by a 
dimensionality-reducing quantizer Q. We present conditions on loss 
functions such that empirical risk minimization yields Bayes consis- 
tency when both the discriminant function and the quantizer are es- 
timated. These conditions are stated in terms of a general correspon- 
dence between loss functions and a class of functionals known as Ali- 
Silvey or /-divergence functionals. Whereas this correspondence was 
established by Blackwell [Proc. 2nd Berkeley Syrup. Probab. Statist. 1 
(1951) 93-102. Univ. California Press, Berkeley] for the 0-1 loss, we 
extend the correspondence to the broader class of surrogate loss func- 
tions that play a key role in the general theory of Bayes consistency 
for binary classification. Our result makes it possible to pick out the 
(strict) subset of surrogate loss functions that yield Bayes consistency 
for joint estimation of the discriminant function and the quantizer. 

1. Introduction. Consider the classical problem of binary classification: 
given a pair of random variables (X, Y) £ (X,y), where X is& Borel subset of 
M. d and y = { — 1, +1}, and given of a set of samples {(X\,Yi), . . . , (X n , Y n )}, 
the goal is to estimate a discriminant function that predicts the binary label 
Y given the covariate vector X . The accuracy of any discriminant function 
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is generally assessed in terms of 0-1 loss as follows. Letting P denote the 
distribution of (X,Y), and letting -y:X— > R denote a given discriminant 
function, we seek to minimize the expectation of the 0-1 loss; that is, the 
error probability F(Y ^ sign(7(X))). 2 Unfortunately, the 0-1 loss is a non- 
convex function, and practical classification algorithms, such as boosting 
and the support vector machine, are based on relaxing the 0-1 loss to a 
convex upper bound or approximation, yielding a surrogate loss function to 
which empirical risk minimization procedures can be applied. A significant 
achievement of the recent literature on binary classification has been the de- 
lineation of necessary and sufficient conditions under which such relaxations 
yield Bayes consistency [2, 9, 12, 13, 19, 22]. 

In many practical applications, this classical formulation of binary clas- 
sification is elaborated to include an additional stage of "feature selection" 
or "dimension reduction," in which the covariate vector X is transformed 
into a vector Z according to a data-dependent mapping Q. An interesting 
example of this more elaborate formulation is a "distributed detection" prob- 
lem, in which individual components of the (i-dimensional covariate vector 
are measured at spatially separated locations, and there are communication 
constraints that limit the rate at which the measurements can be forwarded 
to a central location where the classification decision is made [21]. This 
communication-constrained setting imposes severe constraints on the choice 
of Q: any mapping Q must be a separable function, specified by a collec- 
tion of d univariate, discrete-valued functions that are applied component- 
wise to X. The goal of decentralized detection is to specify and analyze 
data-dependent procedures for choosing such functions, which are typically 
referred to as "quantizers." More generally, we may abstract the essential 
ingredients of this problem and consider a problem of experimental design, 
in which Q is taken to be a possibly stochastic mapping X — » Z, cho- 
sen from some constrained class Q of possible quantizers. In this setting, 
the discriminant function is a mapping 7 : Z — > R, chosen from the class 
r of all measurable functions on Z. Overall, the problem is to simultane- 
ously determine both the mapping Q and the discriminant function 7, using 
the data {(Xi,Y\), . . . , (X n ,Y n )}, so as to jointly minimize the Bayes error 
#Baycs(7,Q):=P(^sign(7(Z))). 

As alluded to above, when Q is fixed, it is possible to give general con- 
ditions under which relaxations of 0-1 loss yield Bayes consistency. As we 
will show in the current paper, however, these conditions no longer suffice 
to yield consistency in the more general setting, in which the choice of Q 
is also optimized. Rather, in the setting of jointly estimating the discrim- 
inant function 7 and optimizing the quantizer Q, new conditions need to 



2 We use the convention that sign(a) = 1 if a > and —1 otherwise. 
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be imposed. It is the goal of the current paper to present such conditions 
and, moreover, to provide a general theoretical understanding of their origin. 
Such an understanding turns out to repose not only on analytic properties 
of surrogate loss functions (as in the Q-fixed case), but on a relationship 
between the family of surrogate loss functions and another class of functions 
known as /-divergences [1, 7]. In rough terms, an /-divergence between two 
distributions is defined by the expectation of a convex function of their like- 
lihood ratio. Examples include the Hellinger distance, the total variational 
distance, Kullback-Leibler divergence and Chernoff distance, as well as var- 
ious other divergences popular in the information theory literature [20]. In 
our setting, these /-divergences are applied to the class-conditional distri- 
butions induced by applying a fixed quantizer Q. 

An early hint of the relationship between surrogate losses and /-divergences 
can be found in a seminal paper of Blackwell [3]. In our language, Black- 
well's result can be stated in the following way: if a quantizer Qa induces 
class-conditional distributions whose /-divergence is greater than the /- 
divergence induced by a quantizer Qb, then there exists some set of prior 
probabilities for the class labels such that Qa results in a smaller probabil- 
ity of error than Q b • This result suggests that any analysis of quantization 
procedures based on 0-1 and surrogate loss functions might usefully attempt 
to relate surrogate loss functions to /-divergences. Our analysis shows that 
this is indeed a fruitful suggestion, and that Blackwell's idea takes its most 
powerful form when we move beyond 0-1 loss to consider the full set of 
surrogate loss functions studied in the recent binary classification literature. 

Blackwell's result [3] has had significant historical impact on the signal 
processing literature (and thence on the distributed detection literature). 
Consider, in a manner complementary to the standard binary classification 
setting in which the quantizer Q is assumed known, the setting in which the 
discriminant function 7 is assumed known and only the quantizer Q is to 
be estimated. This is a standard problem in the signal processing literature 
(see, e.g., [10, 11, 17]), and solution strategies typically involve the selection 
of a specific /-divergence to be optimized. Typically, the choice of an /- 
divergence is made somewhat heuristically, based on the grounds of analytic 
convenience, computational convenience or asymptotic arguments. 

Our results in effect provide a broader and more rigorous framework for 
justifying the use of various /-divergences in solving quantizer design prob- 
lems. We broaden the problem to consider the joint estimation of the discrim- 
inant function and the quantizer. We adopt a decision-theoretic perspective 
in which we aim to minimize the expectation of 0-1 loss, but we relax to 
surrogate loss functions that are convex approximations of 0-1 loss, with the 
goal of obtaining computationally tractable minimization procedures. By re- 
lating the family of surrogate loss functions to the family of /-divergences, 
we are able to specify equivalence classes of surrogate loss functions. The 
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conditions that we present for Bayes consistency are expressed in terms of 
these equivalence classes. 

1.1. Our contributions. In order to state our contributions more pre- 
cisely, let us introduce some notation and definitions. Given the distribution 
P of the pair (X,Y), consider a discrete space Z, and let Q(z\x) denote a 
quantizer — a conditional probability distribution on Z for almost all x. Let 
{j, and 7r denote measures over Z that are induced by Q as follows: 

(la) fj,(z) :=F(Y = 1,Z = z)=p f Q(z\x) dF(x\Y = 1), 

J X 

(lb) tt(z) := P(Y = -1,Z = z) = q [ Q{z\x) aT{x\Y = -1), 

J x 

where p and q denote the prior probabilities p = P(Y = 1) and q = P(Y = 
— 1). We assume that Q is restricted to some constrained class Q, such that 
both [i and ir are strictly positive measures. 
An /-divergence is defined as 

(2) I/CM :=£*(*)/ 

where /:[0, +oo) ^lU{+oo} is a continuous convex function. Different 
choices of convex / lead to different divergence functionals [1, 7]. 

The loss functions that we consider are known as margin-based loss func- 
tions. Specifically, we study convex loss functions cf)(y, 7(2)) that are of the 
form 0(2/7(2)), where the product y r y(z) is known as the margin. Note in 
particular that 0-1 loss can be written in this form, since 1^0-1(2/) 7(2)) = 
I(j/7(z) < 0). Given such a margin-based loss function, we define the 4>-risk 
R<f>(liQ) =E0(Y7(Z)). Statistical procedures will be defined in terms of 
minimizers of with respect to the arguments 7 and Q, with the expecta- 
tion replaced by an empirical expectation defined by samples {(X±, Yi), . . . , 
(X n , Y n )}. 

With these definitions, we now summarize our main results, which are 
stated technically in Theorems 1-3. The first result (Theorem 1) establishes 
a general correspondence between the family of /-divergences and the family 
of optimized 0-risks. In particular, let R^(Q) denote the optimal (fr-risk, 
meaning the 0-risk obtained by optimizing over the discriminant 7 as follows: 

ify(Q) :=inf R^(Q,j). 

In Theorem 1, we establish a precise correspondence between these optimal 
0-risks and the family of /-divergences. Theorem 1(a) addresses the forward 
direction of this correspondence (from <f> to /); in particular, we show that 
any optimal 0-risk can be written as M^(Q) = — Jy(//,7r), where If is the 
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Class of loss functions Class of /-divergences 

Fig. 1. Illustration of the correspondence between /-divergences and loss functions. For 
each loss function <f>, there exists exactly one corresponding f -divergence such that the 
optimized <p-risk is equal to the negative f -divergence. The reverse mapping is, in general, 
many-to-one. 

divergence induced by a suitably chosen convex function /. We also specify 
a set of properties that any such function / inherits from the surrogate loss 
cf). Theorem 1(b) addresses the converse question: given an /-divergence, 
when can it be realized as an optimal c/>-risk? We provide a set of necessary 
and sufficient conditions on any such /-divergence and, moreover, specify a 
constructive procedure for determining all surrogate loss functions (j) that 
induce the specified /-divergence. 

The relationship is illustrated in Figure 1; whereas each surrogate loss <fi 
induces only one /-divergence, note that in general there are many surro- 
gate loss functions that correspond to the same /-divergence. As particular 
examples of the general correspondence established in this paper, we show 
that the hinge loss corresponds to the variational distance, the exponential 
loss corresponds to the Hellinger distance, and the logistic loss corresponds 
to the capacitory discrimination distance. 

This correspondence, in addition to its intrinsic interest as an extension of 
Blackwell's work, has a number of consequences. In Section 3, we show that it 
allows us to isolate a class of (^-losses for which empirical risk minimization is 
consistent in the joint (quantizer and discriminant) estimation setting. Note 
in particular (e.g., from Blackwell's work) that the /-divergence associated 
with the 0-1 loss is the total variational distance. In Theorem 2, we specify a 
broader class of ^-losses that induce the total variational distance and prove 
that, under standard technical conditions, an empirical risk minimization 
procedure based on any such </>-risk is Bayes consistent. This broader class 
includes not only the nonconvex 0-1 loss, but also other convex and com- 
putationally tractable (ft- losses, including the hinge loss function that is well 
known in the context of support vector machines [6] . The key novelty in this 
result is that it applies to procedures that optimize simultaneously over the 
discriminant function 7 and the quantizer Q. 
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One interpretation of Theorem 2 is as specifying a set of surrogate loss 
functions <p that are universally equivalent to the 0-1 loss, in that empirical 
risk minimization procedures based on such (ft yield classifier-quantizer pairs 
(t*>Q*) that achieve the Bayes risk. In Section 4, we explore this notion of 
universal equivalence between loss functions in more depth. In particular, we 
say that two loss functions <p\ and 02 are universally equivalent if the optimal 
risks R ( j )1 (Q) and R c f >2 (Q) induce the same ordering on quantizers, meaning 
the ordering R (j)1 (Q a ) < R^{Q b ) holds if and only if R^ 2 (Q a ) < R<j> 2 {Qb) 
for all quantizer pairs Q a and Qb- Thus, the set of surrogate loss functions 
can be categorized into subclasses by this equivalence, where of particular 
interest are all surrogate loss functions that are equivalent (in the sense 
just defined) to the 0-1 loss. In Theorem 3, we provide an explicit and 
easily tested set of conditions for a 0-risk to be equivalent to the 0-1 loss. 
One consequence is that procedures based on a 0-risk outside of this family 
cannot be Bayes consistent for joint optimization of the discriminant 7 and 
quantizer Q. Thus, coupled with our earlier result in Theorem 2, we obtain a 
set of necessary and sufficient conditions on ^-losses to be Bayes consistent 
in this joint estimation setting. 

2. Correspondence between 0-loss and /-divergence. Recall that in the 
setting of binary classification with Q fixed, it is possible to give condi- 
tions on the class of surrogate loss functions (i.e., upper bounds on or ap- 
proximations of the 0-1 loss) that yield Bayes consistency. In particular, 
Bartlett, Jordan and McAuliffe [2] have provided the following definition of 
a classification- calibrated loss. 

Definition \. Define $ a) &(a) = <p(a)a + 4>{— a)b. A loss function tfi is 
classification- calibrated if for any a, b > and a^b: 



The definition is essentially a pointwise form of a Fisher consistency condi- 
tion that is appropriate for the binary classification setting. When Q is fixed, 
this definition ensures that, under fairly general conditions, the decision rule 
7 obtained by an empirical risk minimization procedure behaves equivalently 
to the Bayes optimal decision rule. Bartlett, Jordan and McAuliffe [2] also 
derived a simple lemma that characterizes classification-calibration for con- 
vex functions. 

Lemma 1. Let 4> be a convex function. Then <f> is classification- calibrated 
if and only if it is differ entiable at and <f>'(0) < 0. 



(3) 




ON SURROGATE LOSS FUNCTIONS AND F-DIVERGENCES 



7 



For our purposes, we will find it useful to consider a somewhat more 
restricted definition of surrogate loss functions. In particular, we impose the 
following three conditions on any surrogate loss function (j) : R — ► M U {+00}: 

Al: (j) is classification-calibrated; 
A2: (j) is continuous; 

A3: Let a* = inf{a 6lU {+oo}|(/>(a) = inf <p}. If a* < +00, then for any 

£>0, 

(4) <f>(a* -e) ></>(«* +e). 

The interpretation of assumption A3 is that one should penalize devia- 
tions away from a* in the negative direction at least as strongly as deviations 
in the positive direction; this requirement is intuitively reasonable given the 
margin-based interpretation of a. Moreover, this assumption is satisfied by 
all of the loss functions commonly considered in the literature; in particular, 
any decreasing function (p (e.g., hinge loss, logistic loss, exponential loss) 
satisfies this condition, as does the least squares loss (which is not decreas- 
ing). When (j) is convex, assumption Al is equivalent to requiring that eft be 
differentiable at and (ft'(0) < 0. These facts also imply that the quantity 
a* defined in assumption A3 is strictly positive. Finally, although eft is not 
defined for —00, we shall use the convention that <f>(— 00) = +00. 

In the following, we present the general relationship between optimal <f>- 
risks and /-divergences. The easier direction is to show that any <^>-risk 
induces a corresponding /-divergence. The 0-risk can be written in the fol- 
lowing way: 

(5a) ^( 7 ,Q)=E0(Y 7 (Z)) 

(5b) = £ 0( 7 (z)H*) + 0(-7(*)M*)- 

z 

For a fixed mapping Q, the optimal 0-risk has the form 
R 4>{Q) = Yl inf(4>(a)n(z) + 0(-a)7r(z)) 

* — 4 a 

= £>(z)irf(fl-a)+#a)^Y 

For each z, define u(z) := 3-4. With this notation, the function inf a (0(— a) + 
<j)(a)u) is concave as a function of u (since the minimum of a collection of 
linear functions is concave). Thus, if we define 

(6) /(u):=-inf(0(-a) + 0(a)n), 

OL 

we obtain the relation 



(7) 



R tt> (Q) = -I f M. 



X. NGUYEN, M. J. WAINWRIGHT AND M. I. JORDAN 



We have thus established the easy direction of the correspondence: given a 
loss function <f>, there exists an /-divergence for which the relation (7) holds. 
Furthermore, the convex function / is given by the expression (6). Note that 
our argument does not require convexity of <f>. 

We now consider the converse. Given a divergence If(fJ>, vr) for some convex 
function /, does there exist a loss function <f> for which R^,(Q) = — //(//, 7r)? 
In the theorem presented below, we answer this question in the affirmative. 
Moreover, we present a constructive result: we specify necessary and suf- 
ficient conditions under which there exist decreasing and convex surrogate 
loss functions for a given /-divergence, and we specify the form of all such 
loss functions. 

Recall the notion of convex duality [18]: For a lower semicontinuous convex 
function / : K — ► K U {oo}, the conjugate dual /* : K — > M. U {oo} is defined as 
f*(u) = sup 1)gK (nf — f(v)). Consider an intermediate function: 

(8) *(/?) = /*(-£)• 

Define pi := inf{/3:*(/3) < +00} and (3 2 ■= inf{/3:*(/3) < inftf}. We are 
ready to state our first main result. 

Theorem 1. (a) For any margin-based surrogate loss function (j), there 
is an f -divergence such that R^(Q) = — //(//, 7r) for some lower semicontin- 
uous convex function f. 

In addition, if (j) is a decreasing convex loss function that satisfies condi- 
tions Al, A2 and A3, then the following properties hold: 

(i) ^ is a decreasing and convex function; 

(ii) *(*(P)) = 0foralipe(J3i,fo); 

(iii) there exists a point u* € {Pi ^2) such that ^(u*) =u* . 

(b) Conversely, if f is a lower semicontinuous convex function satisfying 
all conditions (i)-(iii), there exists a decreasing convex surrogate loss 4> that 
induces the f -divergence in the sense of equations (6) and (7). 

For proof of this theorem and additional properties, see Section 5.1. 

Remarks, (a) The existential statement in Theorem 1 can be strength- 
ened to a constructive procedure, through which we specify how to obtain 
any </> loss function that induces a given /-divergence. Indeed, in the proof 
of Theorem 1(b) presented in Section 5.1, we prove that any decreasing 
surrogate loss function (j> satisfying conditions A1-A3 that induces an /- 
divergence must be of the form 

{u*, ifa = 0, 

<S?(g(a + «*)), ifa>0, 
g(— a + u*), if a < 0, 
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where g : [u* , +00) — ► R is some increasing continuous and convex function 
such that g (it*) = u* , and g is right-differentiable at u* with g'(u*) > 0. 

(b) Another consequence of Theorem 1 is that any /-divergence can be 
obtained from a rather large set of surrogate loss functions; indeed, different 
such losses are obtained by varying the function g in our constructive specifi- 
cation (9). In Section 2.1, we provide concrete examples of this constructive 
procedure and the resulting correspondences. For instance, we show that the 
variational distance corresponds to the 0-1 loss and the hinge loss, while the 
Hellinger distance corresponds to the exponential loss. Both divergences are 
also obtained from many less familiar loss functions. 

(c) Although the correspondence has been formulated in the population 
setting, it is the basis of a constructive method for specifying a class of 
surrogate loss functions that yield a Bayes consistent estimation procedure. 
Indeed, in Section 3, we exploit this result to isolate a subclass of surrogate 
convex loss functions that yield Bayes-consistent procedures for joint (7, Q) 
minimization procedures. Interestingly, this class is a strict subset of the class 
of classification-calibrated loss functions, all of which yield Bayes-consistent 
estimation procedure in the standard classification setting (e.g., [2]). For 
instance, the class that we isolate contains the hinge loss, but not the ex- 
ponential loss or the logistic loss functions. Finally, in Section 4, we show 
that, in a suitable sense, the specified subclass of surrogate loss functions is 
the only one that yields consistency for the joint (7, Q) estimation problem. 

2.1. Examples. In this section, we describe various correspondences be- 
tween (^-losses and /-divergences that illustrate the claims of Theorem 1. 

2.1.1. 0-1 loss, hinge loss and variational distance. First, consider the 0— 
1 loss <f>(a) = I [a < 0]. From equation (5b), the optimal discriminant function 
7 takes the form 7(2) = sign(/i(z) — 7r(z)), so that the optimal Bayes risk is 
given by 

-RBaycs(<5) = ^ min {M(^) > 7r ( z )} 

z&Z 

zG-Z 

where V(fJ,, n) denotes the variational distance V(p,, n) := J2z^z \t L ( z ) ~ 7r ( z )l 
between the two measures \x and ir. 

Now, consider the hinge loss function cp(a) = max{0, 1 — a} = (1 — a)+. In 
this similar calculation yields 7(2) = sign(/x(z) — tt(z)) as the optimal 

discriminant. The optimal risk for hinge loss thus takes the form: 

RhingdQ) = 2min{/z(2:),7r(z)} = 1 - ^ \fi(z) - n(z)\ = 1 - V(fi,7r). 

zez zez 
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Thus, both the 0-1 loss and the hinge loss give rise to /-divergences of 
the form f(u) = — cminju, 1} + au + b for some constants c>0 and a,b. 
Conversely, consider an /-divergence that is based on the function f(u) = 
— 2min(u, 1) for u > 0. Augmenting the definition by setting f(u) = +oo for 
u < 0, we use equation (9) to calculate \&: 

f 0, if P > 2, 

*(/?) = /*(-/?) = sup(-/?n - /(«)) = { 2 - ft if < /? < 2, 
" eR [ +oo, if /? < 0. 

By inspection, we see that u* = 1, where u* was defined in part (iii) of 
Theorem 1(a). If we set g(u) = u, then we recover the hinge loss 4>(a) = 
(1 — a)+. On the other hand, choosing g(u) = e u ~ l leads to the loss 



(10) 0(a) 



(2-e Q )+, fora<0, 
e _Q , for a > 0. 



Note that the loss function obtained with this particular choice of g is not 
convex, but our theory nonetheless guarantees that this non-convex loss still 
induces / in the sense of equation (7). To ensure that <f> is convex, we must 
choose g to be an increasing convex function in [l,+oo) such that g(u) = u 
for u£ [1,2]. See Figure 2 for illustrations of some convex <p losses. 

2.1.2. Exponential loss and Hellinger distance. Now, consider the expo- 
nential loss ^(a) = exp(— a). In this case, a little calculation shows that the 
optimal discriminant is 7(2) = \ log^y. The optimal risk for exponential 
loss is given by 



R CXP (Q) = E V^^W = 1 " E(\/S*) - \/^)) 2 = 1 - 2/i 2 (/^, tt), 

where h(/j,,ir) := ^ J^zeziV^i 2 ) ~ \Ar(z)) 2 denotes the Hellinger distance 
between measures fi and n. Conversely, the Hellinger distance is equivalent 
to the negative of the Bhattacharyya distance, which is an /-divergence with 
f(u) = —2y/u for u > 0. Let us augment the definition of / by setting f(u) = 
+00 for u < 0; doing so does not alter the Hellinger (or Bhattacharyya) 
distances. As before, 



*(/?)=/*(-/?) =sup(-/?u -/(«)) 



1//3, when/3>0, 
+00, otherwise. 



Thus, we see that u* = 1. If we let g(u) = u, then a possible surrogate loss 
function that realizes the Hellinger distance takes the form: 

1, if a = 0, 

(a) = { —r, if a>0, 
a + 1 

—a + 1, if a < 0. 



ON SURROGATE LOSS FUNCTIONS AND F-DIVERGENCES 



11 



o 1 

margin value 



(a) 
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-2 



-■-${a.) = e~ a -a 



1 2 

margin value (a) 

(C) 

Fig. 2. Panels (a) and (b) show examples of <f> losses that induce the Hellinger distance 
and variational distance, respectively, based on different choices of the function g. Panel 
(c) shows a loss function that induces the symmetric KL divergence; for the purposes of 
comparison, the 0-1 loss is also plotted. 



On the other hand, if we set g(u) = exp(-u — 1), then we obtain the ex- 
ponential loss 4>{a) = exp(— a). See Figure 2 for illustrations of these loss 
functions. 

2.1.3. Least squares loss and triangular discrimination distance. Letting 
4>{a) = (1 — a) 2 be the least squares loss, the optimal discriminant is given 
by -y(z) = ^j+^j ■ Thus, the optimal risk for least squares loss takes the 
form 

x ^ 4^(z)tt(z) x - (fi(z) - tt(z)) 2 a , , 
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where A(fj l ,n) denotes the triangular discrimination distance [20]. Con- 
versely, the triangular discriminatory distance is equivalent to the negative 
of the harmonic distance; it is an /-divergence with f(u) = for u > 0. 

Let us augment / with f(u) = +oo for u < 0. We have 

= sup(-/3n - /(«)) = ( ( 2 " ^) 2 ' ^ P * °< 
ugr I +oo, otherwise. 

Clearly, u* = 1. In this case, setting g(v) = u 2 gives the least square loss 
<j)(a) = (1- a) 2 . 



2.1.4. Logistic loss and capacitory discrimination distance. Let 4>(a) 

r( 2 )' 



log(l + exp(— a)) be the logistic loss. Then, j(z) = log^A. As a result, the 



optimal risk for logistic loss is given by 

^iog(Q) = X, lo § ( \ + 7r ( z ) lo s 



zG.2 



7r(z) 



lo g 2-KL(// ^M-ifLU =lo g 2-C(/i,7r) J 



/i + vr 



where KL(U, V) denotes the Kullback-Leibler divergence between two mea- 
sures U and V, and C(f7, V) denotes the capacitory discrimination dis- 
tance [20]. Conversely, the capacitory discrimination distance is equivalent 
to an /-divergence with f(u) = — ulog — log(u + 1), for u > 0. As before, 
augmenting this function with f(u) = +oo for u < 0, we have 



sup(-/3u -/(«)) 



■den 



/?-log(e^-l), for/3>0, 
+oo, otherwise. 



This representation shows that u* = log 2. If we choose = log(l + 
then we recover the logistic loss <^>(a) = log[l + exp(— a)]. 



2.1.5. Another symmetrized Kullback-Leibler divergence. Recall that both 
the KL divergences [i.e., KL(fj,\\ir) and .KX(7r||//)] are asymmetric; there- 
fore, Corollary 3 (see Section 5.1) implies that they are not realizable by 
any margin-based surrogate loss. However, a closely related functional is the 
symmetric Kullback-Leibler divergence [5]: 

(11) KL a (ji,n):=KL(n\\ir)+KL(<>r\\ii). 

It can be verified that this symmetrized KL divergence is an /-divergence, 
generated by the function /(it) = — log u + u log u for u > 0, and +oo other- 
wise. Theorem 1 implies that it can be generated by surrogate loss functions 
of form (9), but the form of this loss function is not at all obvious. Therefore, 
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in order to recover an explicit form for some 4>, we follow the constructive 
procedure outlined in the remarks following Theorem 1, first defining 



In order to compute the value of this supremum, we take the derivative with 
respect to u and set it to zero; doing so yields the zero-gradient condition 
—(3 + 1/u — logu —1 = 0. To capture this condition, we define a function 
r : [0, +oo) — ► [—oo, +oo] via r(u) = 1/u — log it. It is easy to see that r is a 
strictly decreasing function whose range covers the whole real line; moreover, 
the zero-gradient condition is equivalent to r(u) = (3 + 1. We can thus write 
= u + logu — 1 where u = r~ 1 ((3 + 1), or, equivalently, 



It is straightforward to verify that the function thus specified is strictly 
decreasing and convex with ^(O) = 0, and that *(^ r (/3)) = (3 for any (3 6 R. 
Therefore, Theorem 1 allow us to specify the form of any convex surrogate 
loss function that generates the symmetric KL divergence; in particular, any 
such functions must be of the form (9): 



where g : [0, +oo) — > [0, +oo) is some increasing convex function satisfying 
g(0) = 0. As a particular example (and one that leads to a closed form 
expression for 0), let us choose g(u) = e u + u — 1. Doing so leads to the 
surrogate loss function 

4>{a) = e~ a — a — 1, 
as illustrated in Figure 2(c). 

3. Bayes consistency via surrogate losses. As shown in Section 2.1.1, if 
we substitute the (nonconvex) 0-1 loss function into the linking equation (6), 
then we obtain the variational distance V(/z, ir) as the /-divergence associ- 
ated with the function f(u) = minjii, 1}. A bit more broadly, let us consider 
the subclass of /-divergences defined by functions of the form 

(12) f(u) = — cminju, 1} + au + b, 

where a, b and c are scalars with c > 0. (For further examples of such losses, 
in addition to the 0-1 loss, see Section 2.1.) The main result of this sec- 
tion is that there exists a subset of surrogate losses cj> associated with an 
/-divergence of the form (12) that, when used in the context of a risk 



^f(f3) = sup{— (3u + logu — ulogu}. 

u>0 





for a < 0, 
otherwise 
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minimization procedure for jointly optimizing (7, Q) pairs, yields a Bayes 
consistent method. 

We begin by specifying some standard technical conditions under which 
our Bayes consistency result holds. Consider sequences of increasing compact 
function classes C\ C C2 C • • • C T and T)± C T>2 C • ■ • C Q. Recall that T 
denotes the class of all measurable functions from Z — > R, whereas Q is a 
constrained class of quantizer functions Q, with the restriction that \i and 
7r are strictly positive measures. Our analysis supposes that there exists an 
oracle that outputs an optimal solution to the minimization problem 

1 n 

(13) min R (j} h,Q) = min - V V 6(Ya(z))Q(z\Xi), 

and let (7^,Q£) denote one such solution. Let -R|}ayes denote the minimum 
Bayes risk achieved over the space of decision rules (7, Q) G (r, Q): 

( 14 ) #Bayes : = J?* „ #Bayes(7, <?)■ 

(7.V)S(1 ,Q) 

We refer to the nonnegative quantity ^Bayes(7n> Qn) ~~ -^Bayes as ^ ne excess 
Bayes risk of our estimation procedure. We say that such an estimation 
procedure is universally consistent if the excess Bayes risk converges to 
zero, that is, if under the (unknown) Borel probability measure P on X x y, 
we have 

( 15 ' ) n-^o Rb w cs (Tn ' ®n) = ^Bayes in probability. 

In order to analyze the statistical behavior of this algorithm and to es- 
tablish universal consistency for appropriate sequences (C n ,T> n ) of function 
classes, we follow a standard strategy of decomposing the Bayes error in 
terms of two types of errors: 

• the approximation error associated with function classes C„cr, and T> n C 
Q: 

(16) £ (C n ,V n )= inf {R^,Q)}-R;, 

where i2J := inf (7i Q) e (r )S) R^j, Q); 

• the estimation error introduced by the finite sample size n: 

(17) £ 1 (C n ,V n )=E sup |^( 7 ,Q)-^(7,Q)|, 

(7,Q)€(C n ,X>„) 

where the expectation is taken with respect to the (unknown) measure 
¥ n (X,Y). 
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For asserting universal consistency, we impose the standard conditions: 

(18) Approximation condition: lim £o(C n , T> n ) = 0. 

(19) Estimation condition: lim £i(C n ,T> n ) = in probability. 

n-^oo 

Conditions on loss function </>: Our consistency result applies to the class 
of surrogate losses that satisfy the following: 

Bl: (f> is continuous, convex, and classification-calibrated; 
B2: For each n = 1, 2, . . . , we assume that 

(20) M n := max sup sup \(j)(yj(z))\ < +oo. 

?/6{-i>+i}( 7i Q) e (C n ,x>„) 2 e2 

With this set-up, the following theorem ties together the Bayes error 
with the approximation error and estimation error and provides sufficient 
conditions for universal consistency for a suitable subclass of surrogate loss 
functions. 



Theorem 2. Consider an estimation procedure of the form (13), using 
a surrogate loss (p. Recall the prior probabilities p = P(Y = 1) and q = ¥(Y = 
— 1). For any surrogate loss <j) satisfying conditions Bl and B2 and inducing 
an f -divergence of the form (12) for any c > 0, and for a, b such that (a — 
b)(p — q) > 0, we have: 

(a) For any Borel probability measure ¥, there holds, with probability at 
least 1 — S: 

n) ^VBayes 



< -J^S^CmVn) + £ (C n ,V n ) + 2M n y2^^|. 

(b) Universal Consistency: For function classes satisfying the approxi- 
mation (18) and estimation conditions (19), the estimation procedure (13) 
is universally consistent: 

( 21 ) n i^o Rb ^ cs >Qn) = ^Baycs in probability. 

Remarks, (i) Note that both the approximation and the estimation 
errors are with respect to the ^-loss, but the theorem statement refers to 
the excess Bayes risk. Since the analysis of approximation and estimation 
conditions such as those in equation (18) and (19) is a standard topic in 
statistical learning, we will not discuss it further here. We note that our 
previous work analyzed the estimation error for certain kernel classes [15]. 
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(ii) It is worth pointing out that in order for our result to be applica- 
ble to an arbitrary constrained class of Q for which /i and ir are strictly 
positive measures, we need the additional constraint that (a — b)(p — q) > 0, 
where a, b are scalars in the /-divergence (12) and p, q are the unknown prior 
probabilities. Intuitively, this requirement is needed to ensure that the ap- 
proximation error due to varying Q within Q dominates the approximation 
error due to varying 7 (because the optimal 7 is determined only after Q) 
for arbitrary Q. Since p and q are generally unknown, the only /-divergences 
that are practically useful are the ones for which a = b. One such (f> is the 
hinge loss, which underlies the support vector machine. 

Finally, we note that the proof of Theorem 2 relies on an auxiliary result 
that is of independent interest. In particular, we prove that for any function 
classes C and T>, for certain choice of surrogate loss 0, the excess 0-risk is 
related to the excess Bayes risk as follows. 

Lemma 2. Let (J) be a surrogate loss function satisfying all conditions 
specified in Theorem 2. Then, for any classifier- quantizer pair (7, Q) £ (C,V), 
we have 

(22) \ [i?Baycs(7, Q) ~ #BayJ < Q) ~ R l 

This result (22) demonstrates that in order to achieve joint Bayes consistency — 
that is, in order to drive the excess Bayes risk to zero, while optimizing over 
the pair (7, Q) — it suffices to drive the excess 0-risk to zero. 

4. Comparison between loss functions. We have studied a broad class of 
loss functions corresponding to /-divergences of the form (12) in Theorem 1. 
A subset of this class in turn yields Bayes consistency for the estimation 
procedure (13) as shown in Theorem 2. A natural question is, are there any 
other surrogate loss functions that also yield Bayes consistency? 

A necessary condition for achieving Bayes consistency using estimation 
procedure (13) is that the constrained minimization over surrogate <^-risks 
should yield a (Q, 7) pair that minimizes the expected 0-1 loss subject to the 
same constraints. In this section, we show that only surrogate loss functions 
that induce /-divergence of the form (12) can actually satisfy this property. 
We establish this result by developing a general way of comparing different 
loss functions. In particular, by exploiting the correspondence between sur- 
rogate losses and /-divergences, we are able to compare surrogate losses in 
terms of their corresponding /-divergences. 
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4.1. Connection between 0-1 loss and f -divergences. The connection be- 
tween /-divergences and 0-1 loss that we develop has its origins in seminal 
work on comparison of experiments by Blackwell and others [3, 4, 5]. In 
particular, we give the following definition. 

Definition 2. The quantizer Q\ dominates Q2 if PBayes(Qi) < PBayes(<22) 
for any choice of prior probability q = ¥(Y = — 1) G (0, 1). 

Recall that a choice of quantizer design Q induces two conditional distri- 
butions, say P(Z\Y = 1) ~ P\ and P(Z\Y = —1) ~ P—\. From here onward, 
we use P_i and P® to denote the fact that both P-\ and Pi are determined 
by the specific choice of Q. By "parameterizing" the decision-theoretic cri- 
terion in terms of loss function <f> and establishing a precise correspondence 
between 4> and the /-divergence, we obtain an arguably simpler proof of the 
classical theorem [3, 4] that relates 0-1 loss to /-divergences. 

Proposition 1 [3, 4]. For any two quantizer designs Q\ and Q2, the 
following statements are equivalent: 

(a) Qi dominates Q2 [i.e., PBayes(Qi) <PBayes(Q2) for any prior prob- 
ability q £ (0, 1)/; 

(b) I f (Pp 1 , P Q l )>I f (P^ 2 , PSl ) , for all functions f of the form f{u) = 

— min(«, c) for some c > 0; 

(c) I f {pQ\P Q l) > I f (P®\P Q l), for all convex functions f. 

Proof. We first establish the equivalence (a) 44> (b). By the corre- 
spondence between 0-1 loss and an /-divergence with f(u) = — min(u, 1), 
we have i?Bayes(Q) = -If(fi,n) = -If q (Pi,P-i), where f q (u) := qf(^-u) = 

— (1 — q)min(u, •j^r). Hence, (a) 44> (b). 

Next, we prove the equivalence (b) 44> (c). The implication (c) =>■ (b) is 
immediate. Considering the reverse implication (b) =>• (c), we note that any 
convex function f(u) can be uniformly approximated over a bounded interval 
as a sum of a linear function and — Y^ k a k min(u, Ck), where a k > 0,c k > 
for all k. For a linear function /, i/(P_i,Pi) does not depend on P_i,Pi. 
Using these facts, (c) follows easily from (b). □ 

Corollary 1. The quantizer Q\ dominates Q2 if and only if R^{Qi) < 
R<t>{Q2) for any loss function <p. 

Proof. By Theorem 1(a), we have R^Q) = —//(//, 7r) = —If q (P 1 ,P-x), 
from which the corollary follows, using Proposition 1. □ 
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Corollary 1 implies that if R ( j ) (Qi) < R^Qz) f° r some loss function (ft, 
then i?Bayes(Qi) < RBayes(Q2) for some set of prior probabilities on the hy- 
pothesis space. This implication justifies the use of a given surrogate loss 
function (ft in place of the 0-1 loss for some prior probability; however, for 
a given prior probability, it gives no guidance on how to choose (ft. More- 
over, the prior probabilities on the label Y are typically unknown in many 
applications. In such a setting, Blackwell's notion of Q\ dominating Q2 has 
limited usefulness. With this motivation in mind, the following section is de- 
voted to development of a more stringent method for assessing equivalence 
between loss functions. 

4.2. Universal equivalence. Suppose that the loss functions (ft\ and 02 
realize the /-divergences associated with the convex functions f\ and /2, 
respectively. We then have the following definition. 

Definition 3. The surrogate loss functions (ft\ and are universally 
equivalent, denoted by (ft\ ~ (f>2, if for any ¥(X,Y) and quantization rules 
Qi,Q2, there holds: 

In terms of the corresponding /-divergences, this relation is denoted by 

Observe that this definition is very stringent, in that it requires that the 
ordering between optimal (ft\ and <p2 risks holds for all probability distribu- 
tions P on X x y. However, this stronger notion of equivalence is needed for 
nonparametric approaches to classification, in which the underlying distri- 
bution P is only weakly constrained. 

The following result provides necessary and sufficient conditions for two 
/-divergences to be universally equivalent. 

Theorem 3. Let f\ and f2 be continuous, nonlinear and convex func- 
tions on [0, +00) — > P. Then, fi « /j if and only if fi(u) = cf2{u) + au + b 
for some constants c > and a,b. 

An important special case is when one of the /-divergences is the varia- 
tional distance. In this case, we have the following. 

Corollary 2. (a) All f -divergences based on continuous convex 
f : [0, +00) — ► 00 that are universally equivalent to the variational distance 
have the form 

(23) f(u) = — cmin(n, 1) + au + b for some c > 0. 
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(b) The 0-1 loss is universally equivalent only to those loss functions whose 
corresponding f -divergence is based on a function of the form (23). 

The above result establishes that only those surrogate loss functions cor- 
responding to the variational distance yield universal consistency in a strong 
sense, meaning for any underlying P and a constrained class of quantization 
rules. 



5. Proofs. In this section, we provide detailed proofs of our main results, 
as well as some auxiliary results. 

5.1. Proofs of Theorem 1 and auxiliary properties. Our proof proceeds 
via connecting some intermediate functions. First, let us define, for each (5, 
the inverse mapping 

(24) <F X {fi) :=inf {a :<j>[a) < /?}, 

where inf := +oo. The following result summarizes some useful properties 

of <fr x . 

Lemma 3. Suppose that <fi is a convex loss satisfying assumptions Al, 
A2 and A3. 

(a) For all (5 £l such that </>~ 1 (/3) < +oo, the inequality 0(</> -1 (/3)) < j3 
holds. Furthermore, equality occurs when is continuous at 0~ 1 (/3). 

(b) The function 4>~ x :R — > E is strictly decreasing and convex. 

Using the function 4>~ l , we define a new function \& : K — > M. by 

(25) ,wm := ( tt-rHP)), if € R, 

\ +oo, otherwise. 

Note that the domain of § is Dom(§) = {(3 el:^ ! (/3) G R}. Now, define 

(26) /3i :=inf{/3 :*(/?) < +00} and /3 2 := inf {/3 :*(/?) = inf *}. 

It is simple to check that inf <p = inf = ^(a*), and /3i = ^>(a*), ^2 = 4>{—a*). 
Furthermore, by construction, we have ^(fa) = 4>(ot*) = as well as ^(/3i) = 
0(— a*) = /?2- The following properties of ^ are particularly useful for our 
main results. 



Lemma 4. Suppose that 4> is a convex loss satisfying assumptions Al, 
A2 and A3. We have: 

(a) ^ is strictly decreasing in the interval (fix, fa)- If (f> is decreasing, 
then is also decreasing in (—00, +00) . In addition, = +00 for j3 < /3±. 
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(b) is convex in (— oo,/^]. If (ft is a decreasing function, then *5> is 
convex in (—00, +00). 

(c) \E' is lower semi- continuous, and continuous in its domain. 

(d) For any a > 0, 4>(a) = ^ (</>(— a)). In particular, there exists u* G 
0102) such that SP(u*) = u* . 

(e) The function * satisfies ^(§(/3)) < 3 for all 3 G Dom(^f). More- 
over, if (j> is a continuous function on its domain {a G M|0(a) < +oo} ; then 
*(V(3)) = 3for allf3£0iJ 2 ). 



Let us proceed to part (a) of the theorem. The statement for general <fi 
has already proved in the derivation preceding the theorem statement. Now, 
supposing that a decreasing convex surrogate loss (j) satisfies assumptions 
Al, A2 and A3, then 

f(u) = - inf (4>(-a) + 4>(a)u) 

agR 

= - inf (6(-a) + 3u). 

{ a ,(3\4>-i{p)&lMct)=py 

For 3 such that 4>~ l {3) G R, there might be more than one a such that 
4>(ot) = 3. However, our assumption (4) ensures that a = 4>~ l {3) results in 
minimum 0(— a). Hence, 

f{u) = - inf Ui-cp- 1 ^)) + 3u)=- mUdu + §(/?)) 

/3:<^- 1 (/3)eK /9eM 

= sup(-/?u - §(/?)) = 

/3GK 

By Lemma 4(b), the fact that is decreasing implies that is convex. By 
convex duality and the lower semicontinuity of (from Lemma 4(c)), we 
can also write 

(27) = ***(/?) = /*(_/?). 

Thus, ^ is identical to the function ^ defined in equation (8). The proof of 
part (a) is complete, thanks to Lemma 4. Furthermore, it can be shown that 
<f> must have the form (9). Indeed, from Lemma 4(d), we have ^(0(0)) = 
0(0) G , /?2 ) • As a consequence, u* := 0(0) satisfies the relation ty(u*) = 
u* . Since is decreasing and convex on the interval (— 00, 0], for any a > 0, 
we can write 

0(-«) =g{a + u*), 

where g is some increasing continuous and convex function. From Lemma 4(d), 
we have 0(a) = \l/(0(— a)) = ^{g{a + u*) for a > 0. To ensure the continu- 
ity at 0, there holds u* = 0(0) = g(u*). To ensure that is classification- 
calibrated, we require that be differentiable at and 0'(O) < 0. These 
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conditions in turn imply that g must be right-differentiable at u* , with 

</(0>o. 

Let us turn to part (b) of the theorem. Since / is lower semicontinuous 
by assumption, convex duality allows us to write 

/(«) = /*» = **(-«) 

= sup(-/?u - V(P)) = - inf (J3u + *(/?)). 

/3GM 



Note that ^ is lower semicontinuous and convex by definition. To prove that 
any surrogate loss (f> of form (9) (along with conditions A1-A3) must induce 
/-divergences in the sense of equation (6) [and thus equation (7)], it remains 
to show that <fi is linked to via the relation 

(28) * = 

Since is assumed to be a decreasing function, the function (f> defined in (9) 
is also a decreasing function. Using the fixed point u* G {01,02) of function 
fy, we divide our analysis into three cases: 

• For > u* , there exists a > such that g(a + u*) = 0. Choose the largest 
such a. From our definition of </>, 4>{— a) = 0. Thus, 4>~ l {0) = —a. It fol- 
lows that §(/?) = (j){-(t)- l {(3)) = <f>(a) = ^{g(a + u*)) = 

• For (3 < (3\, then ^((3) = +oo. It can also be verified that ^>((3) = +oo. 

• Lastly, for (3\ < (3 < u* < 02, there exists a > such that g{a + u*) £ 
(ii*,/?2) and /3 = ^f(g(a + u*)), which implies that (3 = <f>(ot) from our 
definition. Choose the smallest a that satisfies these conditions. Then, 
4>- l (0) = a, and it follows that §(/?) = 4>(-<j)- l ((3)) = cf>{-a) = g{a + 
u*) = fy(ty(g(a + u*))) = ^((3), where we have used the fact that g(a + 

The proof of Theorem 1 is complete. 



5.1.1. Some additional properties. In the remainder of this section we 
present several useful properties of surrogate losses and /-divergences. Al- 
though Theorem 1 provides one set of conditions for an /-divergence to be 
realized by some surrogate loss (f>, as well as a constructive procedure for 
finding all such loss functions, the following result provides a related set of 
conditions that can be easier to verify. We say that an /-divergence is sym- 
metric if Ij(/i,7r) = If(ir,n) for any measures /i and n. With this definition, 
we have the following. 

Corollary 3. Suppose that f : [0, +oo) — > R is a continuous and convex 
function. The following are equivalent: 

(a) The function f is realizable by some surrogate loss function <f> (via 
Theorem 1). 
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(b) The f -divergence If is symmetric. 

(c) For any u > 0, f(u) = uf(l/u). 

Proof, (a) =>■ (b): Prom Theorem 1(a), we have the representation 
R<j>{Q) = ~^f(f 1 7 7T )- Alternatively, we can write 

MQ) = mm(V(a) + ^~ a )jT^ 

which is equal to —If(ir,fi), thereby showing that the /-divergence is sym- 
metric. 

(b) =>■ (c): By assumption, the following relation holds for any measures 
\x and 7r: 

(29) X>(*)/(/i(*)M*)) = £m(*)/W*)M*)). 

2 2 

Take any instance of z = / £ i?, and consider measures ^' and 7r', which are 
defined on the space Z — {/} such that n'{z) = fj,(z) and tt'(z) = tt(z) for all 
z G i? — {/}. Since condition (29) also holds for // and ir' , it follows that 

ir{z)fMz)/n(z))=n(z)f(w{z)/»{z)) 

for all z € Z and any and 7r. Hence, f(u) = uf(l/u) for any u > 0. 

(c) =>■ (a): It suffices to show that all sufficient conditions specified by 
Theorem 1 are satisfied. 

Since any /-divergence is defined by applying / to a likelihood ratio [see 
definition (2)], we can assume f(u) = +oo for u < without loss of generality. 
Since f(u) = uf(l/u) for any u > 0, it can be verified using sub differential 
calculus [8] that for any u > 0, there holds 

(30) df(u) = f(l/u)+df(l/u)—. 

u 

Given some u > 0, consider any v\ £ df(u). Combined with equation (30) 
and the equality f(u) = uf(l/u), we have 

(31) f(u)- Vl uedf(l/u). 

By definition of conjugate duality, f*{v{) = v\u — f(u). 
Letting ^f(/3) = f*(—/3) as in Theorem 1, we have 

*(*(-«!)) = *(/>i)) = - /(n)) 

= /*(/(«) - = sup(/9/(u) - /3t;i« - /(/?)). 

/3GR 
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Note that from equation (31), the supremum is achieved at (3 = 1/u, so 
that we have ^f(^f(—vi)) = —v\ for any v\ G df(u) for u > 0. In other words, 
= (3 for any f3 G {—df(u),u > 0}. Convex duality and the definition 
= /*(—/?) imply that /3 G —df(u) for some u > if and only if — u G 
d^{(3) for some u > 0. This condition on (3 is equivalent to the sub differential 
d^{(3) containing some negative value, which is satisfied by any (3 G /?2), 
so that = (3 for (3 G , /?2 ) • In addition, since f(u) = +co for u < 0, 

^ is a decreasing function. Now, as an application of Theorem 1, we conclude 
that If is realizable by some (decreasing) surrogate loss function. □ 

The following result establishes a link between (un)boundedness and the 
properties of the associated /. 

Corollary 4. Assume that 4> is a decreasing (continuous convex) loss 
function corresponding to an f -divergence, where f is a continuous convex 
function that is bounded from below by an affine function. Then, (j) is un- 
bounded from below if and only if f is 1-coercive, that is, /(x)/||x|| — ► +oo 
as \\x\\ — > oo. 

Proof. <f> is unbounded from below if and only if ^{(3) = <j){—<j)~ l {(3)) G 
R for all (3 G R, which is equivalent to the dual function f{(3) = ^*{—(3) 
being 1-coercive cf. [8]. □ 

Consequentially, for any decreasing and lower-bounded (f> loss (which in- 
cludes the hinge, logistic and exponential losses), the associated /-divergence 
is not 1-coercive. Other interesting /-divergences such as the symmetric KL 
divergence considered in [5] are 1-coercive, meaning that any associated sur- 
rogate loss 4> cannot be bounded below. 

5.2. Proof of Theorem 2. First let us prove Lemma 2: 

Proof. Since (j> has form (9), it is easy to check that ^(0) = (c — a — b)/2. 
Now, note that 

-RBayes(7) Q) ~ ^Bayes = ^Bayes(7) Q) ~ R~B&ycs(Q) + -RBaycs(Q) ~ -^Bayes 



^^)I(7( 2 )>0)+^)I(7(z)<0) 



- mm{(i(z), TT(z)} + # B ayes(Q) - -^Baycs 



Bayes " 



z:(n(z)-7v(z))~,(z)<0 



In addition 



R; = R^j, Q) - R^Q) + R^Q) - R%. 
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By Theorem 1(a), 

r^q) -r; = -ifM - M Q (-i f (w)) 



c E min{/i(z), vr(z)} - inf c E mm{(j,(z),Tr(z)} 



QeQ 

zdZ z£Z 
= c(i?Baycs(Q) ~ ^Bayes)' 

Therefore, the lemma will be immediate once we can show that 
\ E Hz)-7T(z)\<R (t> ( 7 ,Q)-R 4> (Q) 

z:(p(z)~TT(z)Yf(z)<0 

(32) = £ vr(z)^(- 7 (z)) + nizMiiz)) 

z£Z 

— cmin{/i(z), 7r(z)} + ap + 6g. 

It is easy to check that for any z G Z such that (p(z) — tt(z))^(z) < 0, there 
holds 

(33) 7t(*)0(- 7 (*)) + mWM*)) > 7r(*)0(O) + m(^)0(O). 

Indeed, without loss of generality, suppose /u(z) > n(z). Since is classification- 
calibrated, the convex function (with respect to a) 7r(z)<J)(—a) + fi(z)(p(a) 
achieves its minimum at some a > 0. Hence, for any a < 0, n(z)(j)(— a) + 
n(z)cj)(a) > ir(z)(f>(0) + /x(z)</>(0). Hence, the statement (33) is proven. The 
RHS of equation (32) is lower bounded by 

E i^i 2 ) + / Lt ( 2 0)<MO) - cmin{/x(z),7r(z)} +ap + &<? 

«:(/*(*)-tt(*))7(«)<0 

E (ir(z) + /i(z))- — \ — - -cmm{^(z),Tr(z)} 

«:(m(»)-t(»))7(*)<0 
+ ap + &(/ 

>^ E Mz)-f(*)| -(a + b)(p + q)/2 + ap + bq 

z:(/i(z)-7r(z))7(z)<0 

= | E — vr(^)| + -(a — 6)(^ - q-) 

z:Gu(z)-7r(z)h(«)<0 

>| E M*) -*(*)!■ 

z:(/i(z)-7r(z))7(z)<0 

This completes the proof of the lemma. □ 
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We are now equipped to prove Theorem 2. For part (a), first observe that 

the value of ^P^c n ,QeT> n l-^>(7> Q) ~ R<t>{li Q)\ varies by at most 2M n /n 
if one changes the values of (Xi,Yi) for some index i E {l,...,n}. Hence, 
applying McDiarmid's inequality yields concentration around the expected 
value [14], or (alternatively stated) we have that, with probability at least 
1-5, 



(34) 



sup iRtpijiQ) - R<j,(j,Q)\ - £i{C n ,V r , 

~/£C n ,Q£T> n 



<M n J21n{l/5)/n. 



Suppose that R^(-y,Q) attains its minimum over the compact subset 
(C n ,T> n ) at (jhiQn)- Then, using Lemma 2, we have 

n (- R Baycs(7n> Qn) ~ ^Baycs) < R<t>(ln,Q*n) ~ R % 

= i^( 7 ;, q*) - R^lQl) + Mil Qn) - K 

= R (j) (7* n ,Q n )-R<i > (ilQl)+S (C n ,V n ). 
Hence, using the inequality (34), we have, with probability at least 1 — 5, 

-(-RBaycs(7ni Qn) ~ -^Baycs) 

< fyW'Qn) ~ MllQn) + 2£l(C n ,V n ) 



+ 2M n p\n(2/5)/n + £ (C n ,V n ) 

< 2£ l {C n ,V n ) + £ a {C n ,V n ) + 2Af nV /21n(2/<5)/n, 

from which Theorem 2(a) follows. 

For part (b), this statement follows by applying (a) with 5 = l/n. 

5.3. Proof of Theorem 3. One direction of the theorem ("if") is easy. 
We focus on the other direction. The proof relies on the following technical 
result. 

Lemma 5. Given a continuous convex function — > R, for any 
u, v E M + , define 

Tf { u ,v) := | ^M^M | Q G df(u),(3 G <9f («), a / /?}. 

If fi~ $2, then for any u, v > 0, one o/ i/ie following must be true: 

(1) Tf{u,v) are nonempty for both f\ and fi, and T^^u, v) = Tf 2 {u,v). 

(2) 5oi/i /i and $2 are linear in the interval (u,v). 
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Now, let us proceed to prove Theorem 3. The convex function / : [0, oo) — > 
IR is continuous on (0, oo) and hence is almost everywhere differentiable on 
(0, oo) (see [16]). Note that if function / is differentiable at u and v and 
f'(u) f'(v), then Tf(u,v) is reduced to a number 

uf'ju) - vf\v) - f(u) + f(v) _ r(a) - /*(/?) 

/'(«)-/» a -(3 ' 

where a = f'(u), (5 = f'(v), and /* denotes the conjugate dual of /. 

Let v be an arbitrary point where both fa and fa are differentiable. 
Let d\ = f[(v), d 2 = f 2 {y). Without loss of generality, we may assume that 
fi(v) = fa{v) = 0; if not, we simply consider the functions fi(u) — fi(v) and 
fa{u)-fa{v). 

Now, for any u where both fa and fa are differentiable, applying Lemma 5 
for v and u, then either fa and fa are both linear in [v,u] (or [u,v] if u < v), 
in which case fa(u) = cfa(u) for some constant c, or the following is true: 

uf[(u) - fa(u) - vdi _ ufa(u) - fa(u) -vd 2 

In either case, we have 

(uf[(u) - fa(u) - vdM(u) - da) = {ufa{u) - fa{u) - vd 2 )(f[(u) - di). 

Let 51,52 be defined by fa(u) = gi(u) + d\u, fa{u) = g 2 {u) + d 2 u. Then, 
- gi(u) - vdi)g' 2 (u) = {ug' 2 (u) - g 2 {u) - vd 2 )g[(u), implying that 
(gi(u) + vdi)g' 2 {u) = (g 2 (u) + vd 2 )g[(u) for any u where fa and fa are both 
differentiable. Since u and v can be chosen almost everywhere, v is chosen so 
that there does not exist any open interval for u such that g 2 (u) + vd 2 = 0. 
It follows that gi(u) + vd\ = c(g 2 (u) + vd 2 ) for some constant c and this 
constant c has to be the same for any u due to the continuity of fa and fa. 
Hence, we have fa(u) = g\{u) + d\u = cg 2 (u) + d\u + cvd 2 — vd\ = cfa(u) + 
(dx — cd 2 )u + cvd 2 — vd\. It is now simple to check that c > is necessary 
and sufficient for I$ x and If 2 to have the same monotonicity. 

A. Proof of Lemma 3. (a) Since <^~ 1 (/3) < +oo, we have 4>{4>~ l (/?)) = 
0(inf{a : (j){a) < (3}) < j3, where the final inequality follows from the lower 
semi-continuity of (p. If (ft is continuous at ^ _1 (/3), then we have (/> -1 (/3) = 
min{a : <j){a) = /?}, in which case we have 0(<^> -1 (/3)) = 0. 

(b) Due to convexity and the inequality (f>'(0) < 0, it follows that is a 
strictly decreasing function in (— oo, a*]. Furthermore, for all /3 € R such that 
</> -1 (/3) < +oo, we must have _1 (/3) < a* . Therefore, definition 24 and the 
(decreasing) monotonicity of eft imply that for any a, b € R, if b > a > inf 0, 
then (p~ 1 (a) > ^ _1 (6), which establishes that 4>~ l is a decreasing function. 
In addition, we have a > <j)~ l {b) if and only if 4>(a) < b. 
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Now, due to the convexity of <j>, applying Jensen's inequality for any 
< A < 1, we have 0(A^ _1 (/3i) + (1 - A)^ _1 (/? 2 )) < X^" 1 {(3i)) + (1 - 
A)0(0- 1 (/3 2 )) < A/3i + (1 - A)/3 2 . Therefore, 

A0 _1 (/3i) + (1 - A)^" 1 ^) > _1 (A/3i + (1 - A)/3 2 ), 
implying the convexity of 

B. Proof of Lemma 4- 

Proof, (a) We first prove the statement for the case of a decreas- 
ing function <fi. First, if a > b and _1 (a) ^ M, then (j) (b) ^ R; hence, 
^r(a) = *(6) = +oo. If only </> _1 (&) ^ R, then clearly *(&) > [since 
*(&) = +oo]. If a > b, and both _1 (ai), <t>~ X (P) G R, then, from the pre- 
vious lemma, ^> -1 (a) < <^)~ 1 (6), so that (f)(— 4>~ 1 (a)) < (f)~ l {b)), implying 
that * is a decreasing function. 

We next consider the case of a general function <f>. For (5 G (/?i,/3 2 ), we 
have 4>~ l {(3) G (-a*, a*), and hence -</> _1 (/3) G (-a*, a*). Since <p is strictly 
decreasing in (— oo,a*], then 0(— is strictly decreasing in (/?i,/3 2 ). 
Finally, when (3 < inf * = (f>(a*), R, so *(/?) = +oo by definition. 

(b) First of all, assume that </> is decreasing. By applying Jensen's in- 
equality, for any < A < 1 , we have 

A*( 7 i) + (1 - A)*( 72 ) 

= A0(-r 1 (7i)) + (1 - A)^(-r 1 (72)) 

> <f){— A^~ 1 (7i) — (1 — A)0~ 1 (7 2 )) since <f) is convex 

>0(-0- 1 (A 7 i + (l-A)7 2 )) 
= *(A 7 i + (1 - A)7 2 ), 

where the last inequality is due to the convexity of 4>~ l and decreasing (f). 
Hence, * is a convex function. 

In general, the above arguments go through for any 71,72 G [/?i,/3 2 ]. Since 
*(/3) = +00 for (5 < (3\, this implies that * is convex in (— 00, /3 2 ]. 

(c) For any o£l, from the definition of and due to the continuity 
of 4>, 

{01*09) = H-rHP)) <*} = {P\ - > r\a)} 

= {(3\rHP)<-rHa)} 

= {/3|/3>0(-r 1 (a))} 

is a closed set. Similarly, {/? G 1R| ^ (/?) > a} is a closed set. Hence, * is 
continuous in its domain. 
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(d) Since is assumed to be classification-calibrated, Lemma 1 implies 
that is differentiable at and 0'(O) < 0. Since is convex, this implies that 
is strictly decreasing for a < 0. As a result, for any a > 0, let (3 = 0(— a), 
then we obtain a = -0 _1 (/3). Since = 0(— _1 (/3)), we have VP(/3) = 
0(a). Hence, \l/(0(— a)) = 0(a). Letting u* = 0(0), then we have $?(u*) = u* 
and u* e (/?i,/3 2 )- 

(e) Let a = = 0(— Then, from equation (24), 0~ 1 (a) < 
-0 _1 (/3). Therefore, 

= *(a) = 0(-0~V)) < < /3- 

We have proved that is strictly decreasing for (3 € (A,/^)- As such, 
0- x (a) = We also have 0(0" x {13)) = (3. It follows that = /3 

for all /3 G (/3i,/3 2 ). 

Remark. With reference to statement (b), if is not a decreasing func- 
tion, then the function \& need not be convex on the entire real line. For in- 
stance, the following loss function generates a function ^ that is not convex: 
0(a) = (1 — a) 2 when a < 1, when a G [0,2], and a — 2 otherwise. Then, 
we have (9) = 0(2) = 0, (16) = 0(3) = 1, (25/2) = 0(-l + h/y/2) = -3 + 
5/ v / 2> (tf(9) + tf(16))/2. □ 



C. Proof of Lemma 5. 



Proof. Consider a joint distribution F(X,Y) defined by P(Y = —1) = 
q=l- F(Y = 1) and 

F(X\Y = -1) ~Uniform[0,6] and P(X|Y = 1) ~ Uniform[a, c], 

where < a < b < c. Let Z = {1,2}. We assume Z is produced by a deter- 
ministic quantizer design Q specified by a threshold t E (a, 6); in particular, 
we set Q(z = l\x) = 1 when x > t, and Q(z = 2\x) = 1 when x < t. Under 
this quantizer design, we have 

, i (l) = (l-q) — ; M (2) = (!-?) — ; 

c — a c — a 

t , ^ b-t 

7r(l)=g-; ^(2)=,— . 

Therefore, the /-divergence between [i and 7r takes the form 

gt / (t-0)6(l-g) \ g(b-t) ^ ( C - t )b(l-g) 

= y / v (c _ fl)t , J + (,(c-a)(6-t)g 

If /1 /2, then (//, 7r) and i/j (/i, 7r) have the same monotonicity property 
for any q G (0, 1), as well, as for any choice of the parameters q and a <b < c. 
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Let 7 = ; which can be chosen arbitrarily positive, and then define the 

function 

F(/ , t) . f/ ((^h) + (6 _ t)/ ((^2). 

Note that the functions F(fi,t) and F(f2,t) have the same monotonicity 
property, for any positive parameters 7 and a <b < c. 

We now claim that F(f,t) is a convex function of t. Indeed, using convex 
duality [18], F(f,t) can be expressed as follows: 

F(f,t) = t supj^^r - /* (r) j + (6 - t) sup(^-^ S - /*(*)} 

reM I* J s&R I — t J 

= sup{(t - a) n - tf*{r) + (c - t)s 7 - i/*(s)}, 

which is a supremum over a linear function of t, thereby showing that F(f, t) 
is convex of t. 

It follows that both F(f\,t) and F(f2,t) are sub differ entiable everywhere 
in their domains; since they have the same monotonicity property, we must 
have 

(35) 0edF(f u t)^0edF(f 2 ,t). 

It can be verified using subdifferential calculus [8] that 



t v t J \ t 

(c - t)j\ + (0-6)7 f(c - c)7 



b-t J b-t \ b-t 
Letting u = <yt ~^ >1 ; u = ( c ~^ 7 , we have 
(36a) 0£dF{f,t) 

(36b) O € (7 - u)fl/(u) + / («) - / («) + (c - 7) 9/(«) 

^ 3a€df(u),0€df(v)s.t. 

(36c) 

= (7 - u)q + /(«) - /(«) + (u - 7)/? 
3aedf(u),0€df(v)8.t. 

(36d) 

7 (a - (3)=ua- f(u) + /(«) - v/3 
(36e) ^ 3a€5/(u),)9Ga/(u)s.t.7(a-/9) = /»-rG9)- 

By varying our choice of q G (0, 1), the number 7 can take any positive value. 
Similarly, by choosing different positive values of a, b, c (such that a < b < c), 
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we can ensure that u and v can take on any positive real values such that 
u < 7 < v. Since equation (35) holds for any t, it follows that for any triples 
u < 7 < v, (36e) holds for f± if and only if it also holds for /jj. 

Considering a fixed pair u < v, first suppose that the function f\ is linear 
on the interval [u,v] with a slope s. In this case, condition (36e) holds for 
/i and any 7 by choosing a = (3 = s, which implies that condition (36e) also 
holds for fi for any 7. Thus, we deduce that /2 is also a linear function on 
the interval [u,v]. 

Suppose, on the other hand, that f± and /2 are both nonlinear in [u,v]. 
Due to the monotonicity of subdifferentials, we have df\(u) n dfi(v) = 
and 8/2(11) n 9/2 (v) = 0- Consequently, it follows that both Tf^iu^v) and 
Tf 2 (u,v) are non-empty. If 7 G Tf L (u,v), then condition (36e) holds for /1 
for some 7. Thus, it must also hold for fi using the same 7, which implies 
that 7 G Tf 2 (u,v). The same argument can also be applied with the roles of 
/1 and /2 reversed, so we conclude that Tf Y (u,v) =Tf 2 (u,v). □ 
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